Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON #553

ikawrakow · 2025-06-24T12:21:19Z

This PR corresponds to PRs #531, #533, #534, #546, #549, #550, #552, and applies the on-the-fly repacking technique to
the 1-bit quants IQ1_S and IQ1_M on ARM_NEON.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type	t/s (main)	t/s (PR)	Speedup
IQ1_S	66.3	168.8	2.546
IQ1_M	19.0	163.9	8.626

IQ1_M did not have a faster IQK implementation, so the 19 t/s is what one has within the standard ggml GEMM framework.

66.3 t/s -> 168.8 t/s.

19 t/s -> 163 t/s.

Iwan Kawrakow added 2 commits June 24, 2025 13:27

iq1_s

3c5b788

66.3 t/s -> 168.8 t/s.

iq1_m

e5e5acf

19 t/s -> 163 t/s.

ikawrakow merged commit b5f2f00 into main Jun 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON #553

Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON #553

Uh oh!

ikawrakow commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON #553

Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON #553

Uh oh!

Conversation

ikawrakow commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants